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Spatial audio 



Background 

Prior solutions in audio coders that have been suggested to reduce the bitrate 
of stereo program material include: 

'Intensity stereo \ In this algorithm, high frequencies (typically above 5 kHz) 
5 are represented by a single audio signal (i.e., mono), combined with time-varying and 
frequency-dependent scalefacto/s, . . ... 

'M/S stereo '. In this algorithm, the signal is decomposed into a sum (or mid, a 
common) and a difference (or side, or uncommon) signal. This decomposition is sometimes 
combined with principle component analysis or time-varying scalefactors. These signals are 

10 then coded independently, either by a transform coder or waveform coder. The amount of 
information reduction achieved by this algorithm strongly depends on the spatial properties 
of the source signal. For example, if the source signal is monaural, the difference s ignal is 
zero and can be discarded. However, if the correlation of the left and right audio signals is 
low (which is often the case), this scheme offers only little advantage. 

15 Parametric descriptions of audio signals have gained interest during the last 

years, especially in the field of au^o codingjt has been sto 
parameters that describe audio signals requires only little transmission capacity to ■ 
resynthesize a perceptually equal signal at the receiving end. However, current parametric 
audio coders focus on coding monaural signals, and stereo signals are often processed as dual 

20 mono. 



Invention 

According to an aspec* of the invention, spatial attributes of multichannel 
audio signals are parameterized. It will be shown that for general audio coding applications, 
transmitting these parameters combined with only one monaural audio signal will strongly 
reduce the transmission capacity necessary to transmit the stereo signal compared to audio 
coders that process the channels independently, while maintaining the original spatial 
impression. An important issue is that although people receive waveforms of an auditory 
object twice (once by the left ear and once by the right ear), only a single auditory object is 
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2 22.04.2002 
perceived at a certain position and with a certain size (or spatial difiuseness). Therefore, it 
seems unnecessary to describe audio signals as two or more (independent) waveforms and it 
would "be better to describe multichannel audio as a set of auditory objects, each with its own 
spatial properties. One difficulty that immediately arises is the feet that it is almost 
impossible to automatically separate individual auditory objects from a given ensemble of 
auditory objecls, for example a musical recording. This problem can be circumvented by not 
splitting fee program material in individual auditory objects, but rather describing fee spatial 
parameters In a way that resembles the effective (peripheral) processing of fee auditory 
system. In particular, fee parametric description of multichannel audio presented here is 
related to fee binaural processing model presented by Breebaart et al. This model aims at 
describing fee effective signal processing of fee binaural auditory system. For a complete ' 
model description, see Breebaart et al (200la,b,c). A short interpretation is given below 
which helps to understand fee invention. 

The model splits fee incoming audio into several band-limited signals, which 
ate (preferably) spaced linearly at an ERB-rate scale. The bandwidth of these signals depends 
on fee center frequency, following fee ERB rate. Subsequently, preferably J$w every 
frequency band, fee following properties of fee incoming signals are analyzed: 

- The mteraural level difference, or ILD, defined by toe relative levels of fee band-limited 
signal stemming from fee left and right ears, 

- The interaural time (or phase) difference (ITD or IPD), defined by fee interaural delay (or 
phase shift) corresponding to fee peak to fee interaural cross-correlation function, and 

- The (dis)similarity of fee waveforms feat can not be accounted for by ITDs or HJDs, 
which can be parameterized by fee maximum interaural cross-correlation (i.e., fee value 
of fee cross-correlation at the position of fee maximum peak). 

The three parameters described above vary over time; however, since fee 
binaural auditory system is very sluggish in its processing, fee update rate of these properties 
is rather low (typically tens of milliseconds). 

It may be assumed here feat fee (slowly) time-varying properties mentioned 
above are the only spatial signal properties feat fee binaural auditory system has available, 
and feat from these time and frequency dependent parameters, the perceived auditory world 
is reconstructed by higher levels of fee auditory system. 

It is interesting to mention feat fee ILD and ITD are believed to be fee most 
important looaUzation cues in fee horizontal plane, while fee maximum interaural cross- 
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correlation is strongly related to ike perceptual spatial dffluseness (or compactness) of a 
sound source. 

It is an insight of the inventor* that it is sufficient to describe spatial attributes 
of any nmlticbannel audio signal by specifying the ILD, ITD (or IPD) and maximum 
5 correlation as a junction of time andftequmcy. 

An embodiment of the current invention aims at describing a multichannel 
audio signal by: 

one monaural signal, consisting of a certain combination of the input signals, 

and 

10 a set of spatial parameters: two localization cues (ILD, and ITD or IPD) and a 

parameter that describes the similarity or dissimilarity of the waveforms that cannot be 
accounted for by ILDs and/or ITDs (e.g., me maximum of me cross-correlation function) 
preferably for every time/frequency slot Preferably, spatial parameters are included for each 
additional auditory channel. 

15 Advantages of this parametric description are the following: 

- Decoupling of monaural and binaural signal parameters in audio coders. Difficulties 
related to stereo audio coders are strongly reduced (such as the audibility of interaurally 
uncorroiated quantisation noise compared to interaurally correlated quantization noise). 

- Strong bitrate reduction in audio coders due to a low update rate and low frequency 
20 resolution required for the spatial parameters. The associated bitrate to code the spatial 

parameters is typically 10 kbit/s or less (see embodiment), 

- Easy combination with existing audio coders. The proposed scheme produces one mono 
signal that can be coded and decoded with any existing coding strategy. After monaural 
decoding, the system described here regenerates the spatial attributes. 

23 Tho se f of spatial parameters can be used as an enhancement layer in audio 

coders. For example, a mono signal is transmitted if only a low bitrate is allowed, while by 
including the spatial enhancement layer the decoder oan reproduce stereo sound. 

The invention can m principle, be used to generate n channels from one mono 
signal, if (»-l) sets of spatial parameters are Transmitted. In such condition, the spatial 

30 parameters describe how to form the n different audio channels from the single mono signal. 



Analysis methods 

la the following, it is assumed that the incoming signals are split up in band- 
pass signals (preferably with a bandwidth which increases with frequency) and that 
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4 22.04.2002 
parameters can be analyzed as a fimotion of time. A possible method for time/freqiiency 
slicing would be to use time-windowing followed by a transform operation, but also time- 
continuous methods could be used (e.g., filterbanks). The next steps consist of (1) finding the 
level difference (ELD) of corresponding subband signals, (2) finding the time difference (TTD 
or EPD) of corresponding subband signals, and (3) describe the amount of similarity or 
dissimilarity of the waveforms which cannot be accounted for by XLDs or ITDs, The analysis 
of these parameters is discussed below. 

Analysis of ILDs 

The ILD is determined by the level difference of the signals at a certain time 
instance for a given frequency band. One method to determine the ILD is to measure the rms 
value of the corresponding frequency band of both input channels and compute the ratio of 
these rms values (preferably expressed in dB), 

Analysis ofthelTDs 

The ITDs are determined by the time or phase alignment which gives the best 
match between the waveforms of both channels. One method to obtain the ITD is to compute 
the cross-correlation function between two corresponding subband signals and searching for 
the ma ximum. The delay that corresponds to this maximum in the cross-correlation Sanction 
can be used as ITD value, A second method would be to compute the analytic signals of the 
left and right subband (Le. 9 computing phase and envelope values) and use the phase 
difference between the channels as DPD parameter. 

Analysis of the correlation 

The correlation is obtained by first finding the ILD and ITD that gives the best 
match between the corresponding subband signals and subsequently measuring the similarity 
of the waveforms after compensation for the ITD and/or ILD. Thus, in this framework, the 
correlation is defined as the similarity or dissimilarity of corresponding subband signals 
which can not be attributed to XLDs and/or ITDs. A suitable measure for tins parameter is the 
maximum value of the cross-correlation function (i.e., the maximum across a set of delays), 
However, also other measures could be used, such as the relative energy of the difference 
signal after ILD and/or ITD compensation compared to the sum signal of corresponding 
subbands (preferably also compensated for ILDs and7or ITDs). This difference parameter is 
basically a linear taasfonnaiioii_o£ the (maximum) correlation. 
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Parameter quantization 

An important issue of transmission, of parameters is the accuracy of the 
parameter representation (Le., the size of quantization errors), which is directly related to the 
5 necessary transmission capacity. In this section, several issues with respect to the 
quantization of the spatial parameters will be discussed. The basic idea is to base the 
quantization errors on so-coiledjust-noticable differences (JNDs) of the spatial cues. To be 
more specific, the quantization error is determined by the sensitivity of the human auditory 
system to changes in the parameters. Since it is well known that the sensitivity to changes in 
10 the parameters strongly depends on the values of the parameters itself, we apply the 
following methods to determine the discrete quantization steps. 

Quantization nf TTTta 

It is known from psychoacoustic research that the sensitivity to changes in the 
15 HD depends on rae ILD itself. If the ELD is expressed in dB, deviations of approximately 1 
dB fiom a reference of 0 dB are detectable, while changes in the order of 3 dB are required if 
the reference level difference amounts 20 dB. Therefore, quantization errors can be larger if 
the signals of the left and right channels have a larger level difference. For example, this can 
be applied by first measuring the level difference between the channels, followed by a non- 
20 linear (compressive) transformation of the obtained level difference and subsequently a linear 
quantization process, or by using a lookup table for the available ILD values which have a 
nonlinear distribution. The embodiment below gives an example of such a lookup table. 

Quantization nf the enrr^nn 

25 The quantization error of the correlation depends on (1) the correlation value 

itself and possibly (2) on the ILD. Correlation values near +1 are coded with a high accuracy 
(i.e., a small quantization step), while correlation values near 0 ate coded with a low accuracy 
(a large quantization step). An example of a set of non-ltoeariy.distributed correlation values 
is given in the embodiment. A second possibility is to use quantization steps for the 

30 correlation that depend on the measured ILD of the same subband. for large UJDs (Le., one 
channel is dominant in terms of energy), the quantization errors in the correlation become 
larger. An extreme example of this principle would be to not transmit correlation values for a 
certain subband at all if the absolute value of the HD for that subband is beyond a certain 
threshold. 
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Ouatxtizatioft of the ITDs 

lite sensitivity to changes in the ITDs of human subjects can be characterized 
as having a constant phase threshold. This means that in terms of delay times, the 
quantisation steps for the ITD should decrease with frequency. Alternatively, if the ITD is 
represented in the form of phase differences, the quantization steps should be independent of 
frequency. One method to implement this would be to take a fixed phase difference as 
quantization step and determine the corresponding time delay for each frequency band. This — 
ITD value is then used as quantization step. Another method would be to transmit phase 
differences which follow a frequency-independent quantization scheme. It is also known that 
above a certain frequency, the human auditory system i$ not sensitive to ITDs in the 
finestxucture waveforms. This phenomenon can be exploited by only transmitting ITD 
parameters up to a certain frequency (typically 2 kHz). 

A third method of bitstream reduction is to incorporate ITD quantization steps 
that depend on the ILD and /or the correlation parameters of the same subband. For large 
TT/PSi the ITDs can be coded less accurately. Pqrthermore, if the correlation it very low, it is 
]aiown that the human sensitivity to changes in the ITD is reduced. Hence larger ITD 
quantization errors may be applied if the correlation is small. An extreme example of this 
idea is to not transmit ITDs at all if the correlation is below a certain threshold. 

Embodiment 

The embodiment for a stereo input signal can be schematically drawn as 
ghowninFig. I. 

Fig. 1. Sc&ematic diagram of an embodiment of the invention In the encoder, 
spatial pyrometers are analyzed preferably for each time/frequency slot Subsequently, a sum 
(or dominant) signal is generated consisting of a certain combination of the at least two input 
signals. Synthesis (decoder) is performed by applying the spatial parameters to the sum signal 
to generate left and right output signals. 

hi this embodiment, the spatial parameter description is combined with a 
monaural (single channel) audio coder to encode a stereo audio signal. It should be noted that 
although the described embodiment works on stereo signals, the general idea can be applied 
to n-channel audio signals, with a>l . 
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Analysis 

The left and right incoming signals are split up in various time flames (2048 
samples at 44.1 kHz sampling rate) and windowed with a square-root Hanuing window. 
5 Subsequently, FFTs are computed. The negative FFT frequencies are discarded and the 
resulting FFTs are subdivided Into groups (subbands) of FFT bins. The number of FFT bins 
that are combined in a subband g depends on the frequency: at higher frequencies more bins 
are combined than at lower frequencies. In the current implementation, FFT bins 
corresponding to approximately 1.8 BRBs (Equivalent Rectangular Bandwidth) are grouped, 

1 0 resulting in 20 subbands to represent the entire audible frequency range. Hie resulting 
number of FFT bins S[g] of each subsequent subband (storting at the lowest frequency) is 
S=i4 4 4 5 6 8 9 12 13 17 21 25 30 38 45 55 68 82 100 477] 

Thus> the first three subbands contain 4 FFT bins, the fourth subband contains 
5 FFT bins, etc. For each subband, the corresponding ILD, ITD and correlation (r) are 

1 5 computed. The ITD and correlation are computed simply by setting all FFT bins which 
belong to other groups to zero, multiplying the resulting (band-limited) FFTs from the left 
and right channels, followed by an inverse FFT transform. The resulting cross-correlation 
function is scanned for a peak within an interchannel delay between -64 and +63 samples. 
The internal delay corresponding to the peak is used as ITD value, and the value of the cross- 

20 correlation function at this peak is used as this subband's interaural correlation. Finally, the 
ILD is simply computed by taking the power ratio of the left and right channels for each 
subband. . " 

Generation of the sum signal 

25 The left and right subbands are summed after a phase correction (temporal 

ali gnme nt). This phase correction follows from the computed ITD for that subband and 
consists of delaying the left-channel subband with ITD/2 and the right-channel subband with 
— ITD/2. The delay is performed in the frequency domain by appropriate modification of the 
phase angles of each FFT bin* Subsequently, the sum signal is computed by adding the 

30 phase-modified versions of the left and right subband signals. Finally, to compensate for 
Uncorxelaied or correlated addition, each subband of the sum signal is multiplied with 
sqrt(2/(H-r)) s with r the correlation of the corresponding subband. If necessary, the sum 
signal can be converted to the time domain by (1) inserting complex conjugates at negative 
frequencies, (2) inverse FFT, (3) windowing, and (4) overlap-add 
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Quantization_Qf spatial parameters 

ILDs (in dB) are quantised to the closest value out of the following set I: 
I=C-19 -16 -13 -10-8-6-4-2 0 2 4 6 8 10 13 16 19] 
5 ITD quantization steps are determined by a constant phase difference in each siibband of 0. 1 
racL Thus, for each subband, the time difference that corresponds to 0.1 rad of the siibband 
center frequency is used as quantization step, Fot frequencies above 2 kHz, no ITD 
information is toinsmitted. 

fcrteraural correlation values r are quantized to the closest value of the 

10 following ensemble R: 

ii=[l 0.95 0.9 0.82 0.75 0.6 03 0] 

This will cost another 3 bits per correlation value. 

If the absolute value of the (quantized) ELD of the current siibband amounts 19 

dB, no ITD and correlation values are transmitted for this subband. If the (quantized) 
15 correlation value of a certain subband amounts zero, no ITD value is transmitted for that 

subband. 

In this way, each frame requires a maximum of 233 bits to transmit the spatial 
parameters. With a framelengfh of 1 024 frames, the maximum bitrate for transmission 
amounts 10.25 kbitfc. It should be noted that using entropy coding or differential coding, this 
20 bitrate can be reduced fhrther. 

Synthesis 

In tins part, it is assumed that the frequency-domain representation of the sum 
signal as desoribed in the analysis section is available for processing. This representation may 

25 be obtained by windowing and FFT operations of the time-domain waveform. First, the sum 
Signal is copied to the left and right output signals. Subsequently, the correlation between the 
left and right signals is modified with a decollator. Subsequently, each subband of the left 
signal is delayed by 4TD/2, and the right signal is delayed by ITD/2 given the (quantized) 
ITD corresponding to lhat subband. Finally, the left and right subbands axe scaled according 

30 to the ELD for that subband- To convert the output signals to the time domain, the following 
steps have to be performed: (1) inserting complex conjugates at negative frequencies, (2) 
inverse PPT, (3) windowing, and (4) overiap-add. 

In summary, this application describes a psycho-acoustically motivated, 
parametric description oftbe spatial attributes of multichannel audio signals. This parametric 
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9 22.04.2002 
description allows strong bitrate reductions in audio coders, since only one monaural signal 
has to be transmitted, combined with (quantized) parameters which describe the spatial 
properties of the signal. The decoder can form the original amount of audio channels by 
applying the spatial parameters. For near-CD-quality stereo audio, a bitrate associated with 
these spatial parameters of 10 kbit/s or less seems sufficient to reproduce the correct spatial 
impression at the receiving end. 
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1 A method of coding an audio signal, Hie method comprising: 

generating a monaural signal comprising a certain combination of at least two 

input audio, channels, .... . . 

analyzing spatial parameters of the at least two input audio channels, 
5 preferably for each time/frequency slot, to obtain a set of spatial parameters preferably for 
every time/frequency slot, the set including at least two localization cues (e.g. ILD, and ITD 
or IPD) and a parameter that describes a similarity or dissimilarity of waveforms that cannot 
be accounted for by me localization cues, the parameter being e.g. a maximum of a cross. 

correlation function, and 
10 generating an encoded signal comprising the monoaural signal and the set of 



2, An encoder for coding an audio signal, the encoder comprising: 

means for generating a monaural signal comprising a certain combination of at 

1 5 least two input audio channels, 

means for analyzing spatial parameters of the at least two input audio 
channels, preferably for each time/frequency slot, to obtain a set of spatial parameters 
preferably for every time/frequency slot, the set including at least two localization cues (e-g- 
ILD, and ITD or IPD) and a parameter that describes a similarity or fesmiilarity of 
20 waveforms that cannot be accounted for by the localization cues, the parameter being e.g. a 
TnffftiTpmn of a cross-correlation function, and 

means for generating an encoded signal comprising foe monoaural signal and 

foe set of spatial parameters. 

25 3 _ An apparatus for supplying an audio signal, the apparatus comprising; 

an input for receiving an audio signal, 

an encoder as claimed in claim 2 for encoding foe audio signal to obtain an 

encoded audio signal, and 

an output for supplying foe encoded audio signal. 
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4. An encoded audio signal, the signal comprising: 

a monaural signal comprising a certain combination of at least two audio 



channels, and 



a set of spatial parameters, preferably for every time/frequency sloi^ the set 
including at least two localization cues (e.g. ILD, and ITD or IPD) and a parameter that 
describes a similarity or ^similarity of waveforms that cannot be accounted for by the 
localization cues, the parameter being e.g. a maximum of a (^correlation function 



10 5. 

stored. 

6. 



A storage medium on which an encoded signal as claimed In claim 4 has been 



A method of decoding an encoded audio signal, the method comprising: 
obtaining a monaural signal from the encoded audio signal, the monaural 
signal comprising a certain combination of at least two audio channels, and 

obtaining a set of spatial parameters from the encoded audio signal, preferably 
for every time/frequency slot, the set including at least two localization cues (e.g. ILD, and 
ITD or IPD) and aparameter that describes a similarity or dissimilarity of waveforms mat 
cannot be accounted for by the localization cues, the parameter betoge.g. a maximum of a 
20 cross-correlation junction, and 

applying the spatial parameters to the monaural signal or the at least two audio 
channels to generate a multi-channel output signal. 



7. A decoder for decoding an encoded audio signal 

means for obtaining a monaural signal from the encoded audio signal, the 
monaural signal comprising a certain combination of at least two audio channels, ana 

means for obtaining a set of spatial parameters from the encoded audio signal, 
preferably for every time/frequency slot, the set including at least two localization cues (e.g. 
ILD, and ITD or IPD) and a parameter that describes a shnilarity or dissimilarity of 
waveforms that cannot be accounted for by the localization cues, the parameter being e.g. a 
maximum of a cross-coirelation function, and 

means for applying the spatial parameters to the monaural signal or the at least 
two audio channels to generate a multi-channel output signal 
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8. An apparatus for supplying a deooded audio signal, the apparatus comprising: 

an input for receiving an encoded audio signal, 

a decoder as claimed in olaim 7 for decoding the encoded audio signal to 
obtain a multi-channel output signal, 

an output for supplying or reproducing the multi-channel output signal. 
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ABSTRACT: 

In summary, this application describes a psyoho-acoustically motivated, 
parametric description of the spatial attributes of multichannel audio signals. This parametric 
description allows strong bitrate reductions in audio coders, since only one monaural s ign al 
hzs to be transmitted; combined with, (quantized) parameters which describe the spatial 
5 properties of the signal. The decoder can form the original amount of audio channels by 
applying the spatial parameters. For near-CD-quality stereo audio, a bitrate associated with 
these spatial parameters of 10 kbit/s or less seems sufficient to reproduce the correct spatial 
Impression at the receiving end. 
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